The dataset I use has data on the quality and chemical properties of red wine. The dataset was found tidy, so no prior cleaning was necessary before beginning analysis. I chose this dataset because…well…I like red wine!
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
The dataset has 1,599 observations and 13 variables. The variables measure some chemical properties of the wines, as well as the wine’s quality as rated by a panel of wine experts. All data are either numbers or integers.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The most important variable to me is the quality of the wine. People won’t buy or not buy a wine based on its citric acid content, but they will based on its quality. Examining this variable is a good place to start.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Most wines have a quality rating of either 5 or 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
From the historgram we can see that the distribution of the alcohol content of the wines is skewed to the right. There is a long tail of wines with alcohol contents well above the median.
The boxplot reveals that there are some outliers above an alcohol content of 13 percent.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH is a variable that is normally distributed. From the summary you can see that the mean and median of this variable are almost the same. Wine is acidic, meaning that it has a pH of less than 7. ### Sulphates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
There are many high outliers in sulphates. The data are skewed to the right. Faceted by quality, the shape of the boxplots is different. Also, there are differing levels of outliers. It appears that qualities 5 and 6 have more outliers with higher and lower qualities having fewer.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
When you look at a boxplot of citric acid, the spread of the data looks relatively inocuous, but when you see the historgram you see that the data are all over the place. There are two spikes around 0.0 and 0.5, with the data rising and falling in between. When you look at the box plot, you can see that there are not outliers.
I am interested in quality; what goes into making a good bottle of wine? How does the chemical composition of wine differ between the bad, average, and great wines?
Something that was cool was faceting variables by quality while making charts and seeing if any differences appeared.
Don’t know. However, I may have to learn more than I ever thought I’d learn about the chemical composition of wine to see which variables most influence the quality of wine.
I haven’t created any new variables yet, I just explored existing ones.
The distribution of the citric acid variable was unusual. All others were either normally distributed or skewed to the right.
The million dollar question: Does better wine get you more drunk?
There looks to be a positive correation between the two variables. That means that higher quality wines appear to have higher alcohol contents.
## [1] 0.2056325
There is a positive correlation between alcohol and pH.
## [1] -0.4961798
There is a negative correlation between the two variables. The correlation value is about -.5. There are cases where an wine with a higher density has more alcohol than one with a lower density but the general trend is that wines with more alcohol are less dense. The reason is because alcohol is less dense than water.
## [1] -0.3416993
There looks is a negative correlation between these variables.
## [1] -0.05773139
Wines with a lower pH are more acidic. The presence of acids in the wine influence taste, so more or less acids in the wine may influence its perceived quality. However, upon looking at the data, there isn’t any correlation, between the two variables. The level of acid in the wine seems to not affect its quality.
Faceted by quality, the normal distribution remains seen in the univariate histogram. Wines of each quality level have pH levels that are normally distributed, with pHs that are all in similar ranges.
## [1] 0.2263725
There are ‘bars’ of quality in this graph. They are cool looking, but it is just a side effect of the ‘jitter’ parameter because all the wines have quality ratings of integers.
Zooming into the graph by subsetting on quality, the citric acid content is lower for the lower quality wines. This could be due to the raters’ individual preferences, or due to something more structural about the chemical composition of the wine.
## [1] 0.2513971
There is some positive correlation between the level of sulphates in a wine and its quality.
## [1] 0.1240516
## [1] -0.3905578
There is a weak between fixed acidity and quality. When you remove the highest and lowest rated wines that relationship becomes clearer.
Volatile acidity refers to the presence of steam-distillable acids in the wine. Wine spoilage is measured by volatile acidity. Higher quantities of volatile acidity may indicate spoilage, and thus reduce the quality of the wine. That could explain the negative correlation seen between volatile acidity and quality.
The legal limits for volatile acidity for red wine in the United States is 1.2 grams per liter. Sure enough, almost all of the wines have volatile acidities less than this amount.
## [1] -0.06940835
## [1] -0.2056539
There is a weak negative correlation between sulfur dioxide and alcohol. Upon further research, however, it doesn’t appear that the factors are linked. Some sulfur dioxide is produced during the fermentation process, but most of it in wine is added by winemakers as a preservative. It doesn’t play a role in the creation of alcohol.
I saw a relationship betwee volatile acidity and quality, as well as between factors like pH and alcohol and density. I also saw a relationship between alcohol content and quality.
The relationship between density and alcohol content, as well as between pH and density.
Between density and alcohol content. The more alcohol a wine has, the less dense it is. This is because alcohol is less dense than water.
At a given level of alcohol content, there isn’t any clear relationship betwen citric acid and quality. The correlation is negative for the highest and lowest qualities, almost zero for quality 5, and slightly positive for quality 6.
There is a bit of a relationship between alcohol and fixed acidity. For most qualities of wine, the correlation between fixed acidity and alcohol is negative.
I this section I tried to see the relationship between factors that contributed to higher alcohol contents in wine, and quality. When there is less fixed acidity in the wine, there is more alcohol in the wine.
In this section there weren’t any findings that really jumped out at me.
I didn’t create any models. Many factors go influence the winemaking process and I would be hesitant to create a model that says that just a handful of them can predict the quality of a wine.
This is a histogram of the ratings the wine experts gave the wines in the dataset. I chose this plot because it is the variable that I wanted to find out more about. What makes wine good?
I decided to show this plot as a final plot because alcohol is a variable that is correlated with quality in a dataset where there weren’t many variables correlated with quality. It could be used to further investigate factors that make wine good.
This is a scatterplot of alcohol content and fixed acidity. There are linear regression lines for each level of wine quality. This chart shows a variable that may influence the level of alcohol, which we have shown is positively correlated with quality. This chart shows a starting point for those interested in further exploring what factors influence the quality of wine.
After analyzing this dataset, I’ve come to realize that winemaking is more art than science. There are thousands of factors that go into the flavor of a bottle of wine, most of which are not captured in the dataset. Also, people perceive taste differently and prefer some tastes over others. Had another group of experts rated the wines, we might have different results.
This is shown in the dataset by the lack of correlation between quality and other factors, such as acidity. Acidity is one of the factors that influences a wine’s taste, but for each level of pH or fixed acidity, there isn’t a clear relationship between those variabes and quality. There was a correlation between quality and alcohol content, and between fixed acidity and alcohol but I think that there are other factors that influence this relationship. You can’t just dump a bunch of alcohol in a batch of wine or add acid to the wine to make it taste better! So, I wouldn’t go as far to say that more alcohol or acid makes wine taste better.
I’d also be hesitant to rely much on any mathematical model to judge whether wine is good or not, because of the number of factors and subjective nature of quality.
A way that this dataset could be improved is by including the region in which the wine grapes were grown, as well as climate data for the growing season. Climate and region are very important factors; they acidity and alcohol content, which modify a wine’s taste. Adding these variables to the dataset may uncover more patterns into the interaction of the components that make up wine. The reserachers who collected the data and constructed the dataset declined to include this sort of data for privacy reasons, but having access to it may yield intereting insights.
Successes during the analysis were finding some correlations. It was also fun learning a bit more about wine. However, we have to keep in mind that correlation doesn’t imply causation and that there are many factors that could cause the correlations that we see, or they could even just be random. A struggle for me was constantly reminding myself that there may be more to the correlations than the chart shows, or that they might not even mean anything.